Biostatistics For Dummies (Monika Wahi John Pezzullo)

You may also want to include separate fields to hold prefixes (Mr., Mrs., Dr., and so on) and suffixes

(Jr., III, PhD, and so forth).

Addresses should be stored in separate fields for street, city, state (or province), ZIP code (or

comparable postal code).

Collecting categorical data in your research database

Setting up your data collection forms and database tables for categorical data requires more thought

than you may expect. You may assume you already know how to record and enter categorical data. You

just type in the values — such as “United States,” “nurse,” or “Stage I” — right? Wrong! (But wouldn’t

it be nice if it were that simple?) The following sections look at some of the issues you have to

address when storing categorical values as research data.

Carefully coding categories

The first issue you need to decide is how to code the categories. How are you going to store the values

in the research database? Do you want to enter the type of care provider as nurse, physician, or social

worker; or as N, P, or SW; or as 1 = nurse, 2 = physician, and 3 = social worker; or in some other

manner? Most modern statistical software can analyze categorical data with any of these

representations, but it is easiest for the analyst if you code the variables using numbers to represent the

categories. Software like SPSS, SAS, and R lets you specify a connection between number and text

(for example, attaching a label to 1 to make it display Nurse on statistical output) so you can store

categories using a numerical code while also displaying what the code means on statistical output. In

general, best practices are to set conventions and be consistent, and make sure the content and meaning

of each variable is documented. You can also attach variable labels.

Nothing is worse than having to deal with a data set in which a categorical variable has been

stored with numerical codes, but there is no key to the codes and the person who created the data

set is no longer available. This is why maintaining a data dictionary — described later in this

chapter in “Creating a File that Describes Your Data File” — is a critical step for ensuring you

analyze your research data properly.

Microsoft Excel doesn’t care whether you type a word or a number in a cell, which can create

problems when storing data. You can enter Type of Caregiver as N for the first subject, nurse for

the second, NURSE for the third, 1 for the fourth, and Nurse for the fifth, and Excel won’t stop

you or throw up an error. Statistical programs like R would consider each of these entries as a

separate, unique category. Even worse, you may inadvertently add a blank space in the cell before

or after the text, which will be considered yet another category. Details such as case-sensitivity of

character values (meaning patterns of being upper or lowercase) can impact queries. In Excel,

avoid using autocomplete, and enter all levels of categorical variables as numerical codes (which

can be decoded using your data dictionary).

Dealing with more than two levels in a category

When a categorical variable has more than two levels (like the Type of Caregiver or Likert agreement